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Abstract 

A new class of parallel algorithms is introduced that can achieve 
a complexity of 0 (n 2 ) with respect to the interprocessor communi¬ 
cation, in the exact computation of systems with pairwise mutual 
interactions of all elements. Hitherto, conventional methods exhibit a 
communicational complexity of O(n^). The amount of computation 
operations is not altered for the new algorithm which can be formu¬ 
lated as a kind of /i-range problem, known from the mathematical 
field of Additive Number Theory. We will demonstrate the reduction 
in communicational expense by comparing the standard-systolic algo¬ 
rithm and the new algorithm on the connection machine CM5 and 
the CRAY T3D. The parallel method can be useful in various scien¬ 
tific and engineering fields like exact n-body dynamics with long range 
forces, polymer chains, protein folding or signal processing. 
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1 Introduction 


Within many scientific and engineering applications one is faced with 
the intermediate computation of bilocal objects, f{xi,Xj), on a given 
set of numbers Xi, i = 1,... ,n. Think for example of the exact treat¬ 
ment of 2-body forces in n-body molecular dynamics Q as employed 
in astrophysics or thermodynamics of instable systems, convolutions 
in signal processing Q , autocorrelations in statistically generated time 
series [^, n-point polymer chains with long-range interactions, protein¬ 
folding or fully coupled chaotic maps |^, . 

In the case of bilocal objects, symmetric in i and j, the entire com¬ 
putation of all f{xi,Xj) requires n{n — l)/2 combinations of elements. 
In the spirit of truly parallel processing, the 0{n^) complexity can in 
principle be reduced to 0{n) per parallel processor if the number of 
processors is increased as n, the number of bilocal objects. 

On such massively parallel computers with distributed memory, the el¬ 
ements Xi of the array x are spread out over many processing elements, 
i. e. local memories. For this reason such computations in general 
require considerable interprocessor communication. Any progress in 
reducing the ensuing communication overhead is therefore welcomej^. 

Various methods have been devised in the past to exactly solve the 
computational problem; we mention two generic parallel approaches^: 


• The replicated data method [P deals with n identical copie^ of 
the entire array x that are to be placed within each processor 
in advance. On these arrays, the computation is performed such 
that each processor i calculates the elements f{xi,Xj) for j = 


Tn molecular dynamics applications with long range forces, e. g., exact state-of-the-art 
calculations are restricted to a number of elements (particles) in the range of < 100.000. 
Exact computations are required near locations where physical systems exhibit a phase 
transition. In such applications, repeated computations of the elements f{xi,Xj) have to 
be performed with a number of steps in the range of 10^ to 10^ 0. 

^Parallel ‘linked-cells’ algorithms play the major role in molecular dynamics with short 
range forces p. 

^Here we assume that the numbers of processors, p, and array elements, n, are equal. 


The case p < n is discussed in section 4.7 
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• The parallel organization of the evaluation in form of a systolic 
array computation i i 0, 0 proceeds with one moving and 
one fixed array, that are distributed across the processing el¬ 
ements. In each systolic^ step the moving array is shifted by 
one position along the processing elements and subsequently, the 
computations are performed. After n — 1 steps, all pairs f{xi,Xj) 
have been generated. 


Various procedures which combine features of both above mentioned 
methods are known in the literature. Common to all such approaches 
is that both the computational complexity and the complexity of the 
interprocessor communication is O(n^). 


In this work, we introduce a parallel computing concept that, to a 
certain extent, can be regarded as a generalization of systolic array 
computations with ‘pulse’ and ‘circular movement’. We will point 
out that a formulation of the computational problem in the context of 
Additive Number Theory—leading to a kind of /i-range problem [14|— 
defines a new class of algorithms. Choosing the base of the /i-range 
problem suitably, one can recover the classical systolic realization of 
the computational problem. In addition, various other bases can be 
contrived that allow to diminish the complexity of the interprocessor 
communication. We will present a selection of extremal (shortest) 
bases and a ‘regular’ base that both reduce the complexity of the 
interprocessor communication to 0(n2). The regular base is nearly 
optimal and well suited for massively parallel machines. 


The new method, further denoted as ‘hyper-systolic’, has in common 
with the standard-systolic array computation that only one array is 
moving in each step in a regular communication pattern. Therefore, 
it is well suited for SIMD and MIMD machines as is the case with 
standard-systolic computations. The distinctive feature as compared 
to the standard-systolic concept, however, lies in the fact that storing 
of shifted arrays is required^, similar to the replicated data method. 

But in contrast to the latter where the required storage goes with n^, 

^In molecular dynamics, systolic array computations are known as Orrery-algorithm 

® Complexity in (computer) time is avoided at the expense of complexity in (storage) 
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a significant reduction in interprocessor communication is achieved 
while only a moderate additional amount of storage space is needed. 
To achieve the optimal hyper-systolic speed-up, the required storage 
goes with n 2 . We will demonstrate that, in view of the state-of- 
the-art problem sizes mentioned in footnote ||, our method can be of 
considerable practical importance. 

In section we present the generic form of the computational prob¬ 
lem to solve, in section |3|, we review the standard-systolic concept. In 
order to provide a graphical illustration, we will deal with systolic au¬ 
tomata models |[^, |^. A suitable generalization of the latter which 
gives us the means to graphically represent the hyper-systolic parallel 
computing structure is worked out in section In section we dis¬ 
cuss implementation issues concerning the hyper-systolic method on 
parallel machines and we compare standard-systolic computations on 
the Connection Machine CM5 with their hyper-systolic counterpart. 
Furthermore we present results from measurements on the Cray T3D. 
Finally, we summarize and give an outlook. 


2 The Computational Problem 


The generic computational task we deal with in 
calculation of 

the following is the 

n 

Vi ■=^ fixi,Xj), i = 1,2,3,... ,n, 

i=i 

* / j, (1) 

with the input array (sequence) 


X = {xi,X2,X3, ...,Xn) 

(2) 

and the resulting array 


y = {yi,y2,y3, • • ■,yn)- 

(3) 


space. 
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The number of combinations {xi,Xj} required for the computation of 
all f{xi,Xj) symmetric in i and j is 

n\_ n{n — 1) 

2 ) ~ 2 ’ 

i. e., the computational complexity inherent to Eq. is 0{'n?). 

In molecular dynamics terms, the sequence x can e.g. be thought of 
as the coordinates of n particles, where the sequence y describes the 
potential (or equivalently a component of the force) which particle 
i is subject to in the presence of all the other particles. 


3 Standard-Systolic Computation 


We first review the systolic realization of Eq. [I| from which our further 
considerations start. We use the concept of systolic automata models 


to illustrate parallel computing structures [IT 


A systolic automaton consists of cells with data transfer and process¬ 
ing units, see Eig. |^, that are arranged in a regular grid. The cells are 
coupled to each other in a uniform next-neighbour wiring pattern as 
depicted in Fig. The processing elements (pe) are drawn as open 
triangles. They realize functions of equal load between consecutive 
communication events. The data transfer elements are drawn as black 
squares. They are delay elements (de) that represent abstractions for 
memory locations in which data is shifted in and out in regular clock 
step^. Data processing and transfer are completely pipelined|]. 

The graphical structural counterpart of Eq. is given in Fig. The 
cells are consecutively arranged in a linear order. Each cell is con¬ 
nected with its left and right next neighbours. Note that the systolic 
computation of Eq. || leads to a toroidal topology of the linear array. 

®By ‘clock step’ we denote the clock step of the abstract automaton rather than a 
physical computer clock step. 

^ Precise definitions of systolic arrays and algorithms along with many examples and 
applications can be found in two monographs by N. Petkov 0,0 
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Figure 1: Processing element (open triangle) and delay element (black 
square), arranged in a systolic automaton cell (large dotted square). The 
element Xj resides stationary in the processing element. In each clock step 
the delay element delivers another data element Xj of the moving array. The 
function f{xi,Xj) is computed and successively added to Ui that is resident 
in the processing element as well. 

At clock step 0, a sequence x of n data elements is distributed over 
n processing elements. This sequence will stay fixed in the following 
process. Initially a copy x of the array x is made. The array x is 
shifted and the processing on all cells is performed subsequently clock 
step by clock step. In Fig. the situation is shown for the first clock 
step. Fig. ^ illustrates the state of the systolic computation after 3 
clock steps. 

The small black boxes stand for the de’s, the function of which is to 
shift the data element Xi in one clock step from cell ^ i to cell ^ 

(i + 1). Subsequently, the pe’s perform the numerical computation of 
the function /, i.e., they link the elements of the fixed array x with 
the elements of the shifted array x. The result in the Tth cell is added 
to the resulting data element y* of the array y such that after re — 1 
steps, the resulting data elements are distributed over all processing 
element^. 

®In terms of the molecular dynamics example, each particle coordinate is located to¬ 
gether with its respective force value in the same processing element. 
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Figure 2: Systolic array of abstract automaton cells with uniform next 
neighbour connections in toroidal topology. 




+f(x,,xp 


+f(x,,xp 


+f(X3,xP 


y„,=f(x„_,,x„) y„=f(x„,x,) 

+f(x„, ,x,) +f(x_^,x,) 

+f(x„ , ,X,) +f(x_^,X3) 


Figure 3: Data location of the systolic array after the third clock step. 


The mapping of the abstract automaton model onto a real parallel 
computer is straightforward: the pe’s are identified with the proces¬ 
sors of the parallel machine, the de’s symbolize registers or memory 
locations exchanging data via the interprocessor communication net¬ 
work. Of course, for the parallel machine a communication network 
is required that allows to realize the above considered particular com¬ 
munication pattern with a toroidal topology efficiently. This is the 
case for the most SIMD and MIMD parallel machines, see the imple¬ 
mentation of the hyper-systolic algorithm described in section ||. 

The complexity of the systolic computation of Eq. || is of order v? 
for both the data transfer operations and the processing operations, 
i. e. exactly n(n — 1) processing operations and communication events 
were required if the method would be applied to the full array with n 
elements. One can improve on the scaling of the systolic realization 
taking into account the symmetries. The computations can be reduced 
to n{n — 1)/2 processing operations but now n(n -|- 1) communication 
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events have to be performed, since also the result array has to be 
shifted n/2 times and finally one reverse shift must be carried out. 
This method, called ‘Half-Orrery’ algorithm, has been introduced in 
a recent paper by Scholl and Alimi |^]. The limiting behaviour of 
O(n^), however, is the same as before, for both communication and 
CPU operations. 


4 The Hyper-Systolic Algorithm 


The main message of this paper is that the total time expenditure 
for the interprocessor communication in the computation of Eq. 
can be reduced considerably as it is possible to achieve a complexity 
of 0(n2) by a suitable re-organization of the data-movement for the 
communicational part of Eq. [T|, at a moderate expense of storage space. 
The amount of processing operations is not altered. 


4.1 Motivation 


Consider the matrix C which explicitly enumerates the data elements 
of the shift-array xt for the clock steps t = 0,1, 2,... , n—1, as delivered 
successively in the systolic implementation of section 


/ 

1 

2 

3 .. 


re 

\ 

o 

II 


n 

1 

2 .. 

re — 2 

re — 1 


t=l 


re — 1 

re 

1 .. 

re — 3 

re — 2 


t=2 








II 

CO 

V 

2 

3 

4 .. 

re 

1 

) 

t=n-l 


We observe in (^) that all combinations of different elements actually 
occur n times, if every shifted array is stored for t = 0,1, 2,... , n — 1 
and combinations between all the rows are allowed. Note that storing 
the shifted arrays is in contrast to the systolic concept where only one 
hxed and one moving array came into play. Here, as time proceeds, an 




increasing number of ‘resident’ arrays is needed, with only one array 
in movement. 


The matrix C shows that an n-fold redundancy of pairings is encoun¬ 
tered if all shifted arrays are stored. It appears natural to reduce this 
redundancy of pairings and optimize by storing a few arrays only, 
for a suitably selected subset of time steps t. 


To illustrate the idea, let us consider a simple example for n 
from matrix C in Eq. ^ we take five rows: 


C' 


/I 2 3 4 5 6 
16 1 2 3 4 5 
14 15 16 1 2 3 
12 13 14 15 16 1 
ys 9 10 11 12 i; 


7 8 9 10 11 

6 7 8 9 10 

4 5 6 7 8 

2 3 4 5 6 

14 15 16 1 2 


12 13 14 15 16 \ 
11 12 13 14 15 
9 10 11 12 13 
7 8 9 10 11 
3 4 5 6 7 ; 


16; 


( 6 ) 


It suffices to discuss in C' combinations with element 1 only, because 
of the toroidal topology of the systolic array. In the first column the 
combinations {1,16}, {1,14}, {1,12} and {1,8} can be found, in the 
second column {1,2}, {1,15}, {1,13} and {1,9} occur, in the fourth 
column we have {1,3}, {1,4} and {1,11}, in the sixth column we have 
{1,5} and {1,6} and in the tenth column the yet missing combinations 
{1,10} and {1, 7} are present. 


Thus, in order to generate all pairs for re = 16 only 5 stored arrays are 
required. 

We note further that to realize Eq. |^, we cannot just add the outcome 
of the computations to one resulting array: for every row Xf to be 
stored, a separate resulting array is required. These arrays have to 
be shifted back in steps inverse to the foregoing. Then they can be 
added to give the final resulting array y. 


Therefore, for this small example, in total 8 shift-operations are re¬ 
quired. In the standard systolic process 15 shifts of the moving array 
must be carried out. Using the symmetry in i and j, in total, 17 shifts 
are to be performed (Half-Orrery algorithm |l^), because in this case 
the result array must be transported too, and finally shifted back. The 
array length 4 represents the cross-over between standard-systolic and 
the new way to compute Eq. concerning the number of array shifts. 


9 




Going to larger arrays, the new algorithm becomes more and more 
advantageous. 


4.2 The algorithm 


We now give a general prescription to compute Eq. according to 
the ideas exposed in the last section. The new algorithm is called 
hyper-systolic, the name referring to the topological features of the 
underlying communicational structure: 

1. For a given array x of length n, k copies xj are generated by 
shifting the original array x k times by a stride at and storing 
the resulting arrays as x*, 1 < t < /c. The variable stride at of 
the shifts must be chosen such that 

(a) all pairings of data elements occur at least once, 

(b) the number of equal pairings is minimized, 

(c) the number of shifts k is minimized, 

(d) the number of different strides at is minimized. 

2. The required n(n — l)/2 results yi according to Eq. are succes¬ 
sively computed and are added to k resulting arrays y^. 

3. Finally, applying the inverse shift sequence, the arrays yt are 
shifted back and corresponding entries are summed up to build 
the elements of the final array y. 

In the above defined hyper-systolic parallel computing structure, in 
general, 2k intermediate arrays are required. Later, we will come to 
special realizations that work with less intermediate arrays. 


4.3 Graphical representation 

The concept of systolic automata models can be transferred to the 
hyper-systolic parallel computing structure. For the graphical repre¬ 
sentation we use the objects introduced in section |3[ Fig. ^ shows 
the i-th hyper-systolic cell unit. The de’s again represent input and 
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Figure 4: Processing element (open triangle), delay element (black square) 
and store elements (se) (open squares), arranged in a hyper-systolic automa¬ 
ton cell (large dotted square). The element x, resides stationary in the storing 
element. First, in each clock step t, the delay element delivers a new data 
element Xi^at that is both stored in the local se and transmitted further. 
Subsequently the computing is performed on the relevant combinations of 
yet stored elements. The results are stored in the corresponding memory 
locations yi+at- Note that each de is equipped with r connectors to account 
for the r different strides Second, the results yi+at have to be successively 
shifted back, using a stride sequence inverse to the above, and added up to 
yield the hnal result. 


output. Additionally we use open rectangles to symbolize the mem¬ 
ory locations for the stored data elements Xi+at and yi+af, 1 < t 
1 < z < n. Thus we account for the original systolic functionality of 
the de’s acting on transient data elements only|^. The de’s now pos¬ 
sess 1 input and r output connectors, one separate connector for each 
required shift width at- In general, the number of connectors r is not 
equal to k. It depends on the chosen hyper-systolic scheme. 

Again the cells are arranged in a uniform grid; in addition to the 
next-neighbour wiring, the new connections between the processors 
are drawn in Fig. They correspond to the shifts of data elements 

®We note that the index z -|- at is to be taken as z -I- at — n for z -|- at > n (or mod(z -|- 
Ot — 1, n) -I- 1) because of the toroidal topology of the hyper-systolic array. 
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Figure 5: 3 abstract automaton cells as part of a hyper-systolic array with 

r = 2 uniform next neighbour connections and toroidal topology. In this 
simplihed case, the two strides are of length 1 and 2. 

by strides at- According to the number r of different strides at, r direct 
connections would be optimal. Prom the point of view of the origi¬ 
nal one-dimensional systolic array structure—due to these additional 
communication connections—the (one-dimensional) space of process¬ 
ing elements has acquired topologically non-trivial interconnections. 

Fast interprocessor data transfer to distant cells can be achieved in 
one clock step via these interconnections that are used as shortcuts. 

Thus the characterization of our new algorithm as hyper-systolic. 

The pe’s now don’t realize functions of equal load between consecu¬ 
tive communication events. The amount of processing operations g{t) 
increases with an increasing number t of stored arrays y^/, 1 < F < t. 

We emphasize again that hyper-systolic data movement and storing 
extends the standard-systolic concept. 

Fig. I indicates the graphical structural counterpart of Eq. || for the 
hyper-systolic algorithm; we have plotted the situation for t = 1. At 
clock step t = 0, a sequence x of n data elements is distributed over 
n processing elements. In every clock step t = 0,1,2,..., k, a copy 
of the array Xi_i is generated and, in the next clock step, the array 
is shifted by a stride a*. Subsequently, the new array x^ is stored in 
the se’s. Then the processing can be done by the pe’s involving ever 
more of the stored arrays for increasing t. Fig. ^ shows the situation 
after 2 clock steps. Note however that only one array is moving as is 
the case in the standard-systolic computation. 
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Figure 6: Data location on a part of the hyper-systolic array after the second 
clock step. 

In clock step t, the g{t) results in the hth cell are added to the cor¬ 
responding elements 1 < F < t. Finally, when n(n — l)/2 

processing operations have been performed, the k resulting arrays 
are shifted back by strides inverse to the above sequence and succes¬ 
sively added up to the result array y. 


4.4 Additive Number Theory and hyper-sys¬ 
tolic parallel processing 


As explained in section |4.2| , our task is to construct a sequence of 
strides = {oq = 0 , ai, 02 , 03 ,..., a^} that—in the course of the 
hyper-systolic process—provides us with k + 1 sequences xti t = 
0,1, 2,... , fc, represented as a rectangular matrix C' 




( X\-\-(2q X2-\-aQ ^3-\-aQ 

3^1+ai ^3-hai 


\ ^l+ak 372+afc 


Xn+ao \ 


^n+afe / 


(7) 


As mentioned above, in C all indices are understood as mod(i -|-ai — 
l,n) -|- 1. It is required that all combinations of elements from 1 to 

^°Note that we have introduced the zero-element. 
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n can be constructed along the columns of C. The integer k is to be 
chosen as small as possible, at is the stride by which xt-i is cyclically 
shifted to render xj. 


We formulate the algorithmic problem in terms of Additive Number 
Theory: 


Let I be the set of integers m = {0,1, 2 ... , n — 1} E NqjTi E N. 
Find the ordered set Ak = (oq = 0, oi, 02 , 03 ,, a^) E of /c + 1 
integers, with k minimal, where each m ^ I, (0 < m < n — 1), can be 
represented at least once as the ordered partial sum 


m — at + Uj+i + ... + Oj+j 


( 8 ) 


or as 


with 


m — n — (oj + Oj+i + ... + Oi+j), 
0 <i+J<fc, i,j €No. 


(9) 

( 10 ) 


This formulation with the extremal base A^. is reminiscent of the 
‘postage stamp problem’, a famous problem in Additive Number Theo¬ 
ry |H, W]. In contrast to the original postage stamp problem, here 
we deal with sequences A^ instead of sets and with ordered partial 
sums. In these terms we are looking for the /i-range n(/i. Ah) of the 
extremal basis Ah with only ordered partial sums of the elements of 
Ah allowed as given by Eqs. P, ^ and |l^. 


Unfortunately, our problem is just as little solvable in general as the 
postage stamp problem. Nevertheless, later we will propose an ap¬ 
proximate solution that carries optimal asymptotic scaling properties 
and is very well suited for the hyper-systolic realization on parallel 
SIMD and MIMD machines. 


The formulation in terms of an /i-range problem appears to provide 
a quite general framework covering a whole class of algorithms: any 
chosen base A induces its specific new algorithm. 

The realization of Aj. with k = n — 1 brings us back to the standard- 
systolic algorithm with the base 

A„_i = {0,ai,... ,an_i}, Oj = I V i = I,...,n - 1, (II) 
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that features a communicational complexity of O(n^). 

The realization 

Tl 

= {0,ai,... ,a^}, ai = l = (12) 

leads to the Half-Orrery algorithm if the result array is moved too. 

A lower bound on the minimal possible k, with optimal complexity, 
can be derived by the following consideration: 

The total number of combinations required is n{n — l)/2. Let the ma¬ 
trix C' of Eq. ^ be realized by k shifts, i. e. by k-\-l arrays. The number 
of possible combinations between the elements of a given column of 

. Thus, the following unequality holds: 

n{n -1) ^ Ik + l\ 

2 -[ 2 ) 

therefore, 

with k integer. Obviously, a sequence with less than kmin elements 
cannot solve the h-range problem presented above. In other words, the 
complexity for the interprocessor communication of any hyper-systolic 
algorithm can at best be n^/n. 



C follows as 



4.5 Regular bases 


In choosing a suitable base for the hyper-systolic algorithm, we bet¬ 
ter make use of the fastest links of our parallel architecture. So we 
have to meet a constraint: in general, from each processor there is 
a limited number r of fast connections to r other processors. This 
is, however, very much dependent on the particular parallel machine. 
On (hyper)cubic interconnection networks, e.g., r relates to the di¬ 
mension of the hypercube. We should prefer shift sequences that are 
restricted to exploit the r fast interprocessor connections available, in 
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the actual implementation of the hyper-systolic method on a parallel 
machine. To a certain extent, the network can provide shortcuts or 
‘wormholes’ through higher dimensions that allow to accelerate the 
standard-systolic movement along the one-dimensional array. 

In the case of the standard-systolic array, r = 1, because only next 
neighbour wiring is requested in the systolic process. This amounts 
to take local trains only for long-distance traveling. 


The next simple step would be a mix of intercity and local connections 
for the long-distance journey. In fact, a nearly optimal choice of a 
suitable sequence, leading to a very efficient implementation of the 
hyper-systolic ideas and requiring r = 2 interprocessor connections 
only, is given by: 

Ak=2K-i={lA,...A,K,K,...,K), K = ^, 

(15) 

V - - ^ ^ ^ ^ ' 

V ' ^ ^ ’ V 

K K -1 

with K shifts by 1 and K — 1 shifts by K. The length of the base is 
k = 2K+1. It can easily be shown that such a base leads to an h-range 
of n with respect to Eqs. and |9|. The communicational complexity of 
the hyper-systolic method using the regular shift sequence with shift 
constant K, i. e. strides Of = 1 for t < K and at = K ioi t > K, turns 
out to be 2n{2^/nj2 — 1). Note that we have included an additional 
factor 2. This is due to the fact that the number of required reverse 
shifts the result arrays are subject to also is given by 2^/ri/2 — 1. 


In order to illustrate the hyper-systolic array generated by the regular 
base let us plot an example where n = 32 and K = = 4 m Fig. 


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 

32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 

25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 


Figure 7: Matrix C’ for n = 32 using the regular base Aj = (1111444). 


Building combinations along the columns, all pairings of the numbers 
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1 to 32 between the different rows can be found at least once. Note 
that the last row requires a separate treatment as, only in that case, 
the pairs occur twice. 

Obviously, K = makes only sense for integer values K. Therefore, 
^ should be a square number. However, if this is not possible, one can 
switch over to a suitable neighbouring integer value K ^ as shift 
constant. 


4.6 Shortest bases 


The regular bases just introduced are well suited for massively-parallel 
machines with a large number of processors. 

Let us consider the ratio R between standard- and hyper-systolic com- 
municational complexity, the “gain-factor”. 



On a machine with 32 processors, e.g., we can gain a factor of 2 in 
interprocessor communication events between standard-systolic and 
hyper-systolic computations. This ratio is increasing rapidly going to 
machines with larger numbers of processors and e.g. becomes 11 for 
1024 processors. 

On coarser grained computers with ^ of processors < 64, other bases 
than the regular type might be more advantageous. 

In absence of analytical solutions, for various numbers of processors 
p from 4 to 64 we have constructed well suited extremal bases by 
straightforward computation. We tried to prefer bases with elements 
being a power of 2, as these strides might lead to fastest interprocessor 
communication on most parallel machine’s networks. 

Table collects our selection of extremal bases. On a 32 node parallel 
machine, e.g.^ we can theoretically gain a factor R = 2.75 compared 
io R = 2 using the regular base. 
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p 

k 

Ak 

P 

k 

Ak 

4 

2 

1 1 

5 

2 

1 1 

6 

2 

1 2 

7 

2 

1 2 

8 

3 

1 1 2 

9 

3 

1 1 2 

10 

3 

1 2 2 

11 

3 

1 2 2 

12 

3 

1 2 4 

13 

3 

2 1 4 

14 

4 

1114 

15 

4 

1114 

16 

4 

12 2 4 

17 

4 

112 8 

18 
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1114 4 

21 

4 

3 10 2 5 

22 

5 

1114 4 

23 

5 

1114 4 

24 

5 

1114 8 

25 

5 

112 8 8 

26 

5 

1 2 3 4 4 

27 

5 

1 2 2 8 8 

28 

6 

111444 

29 

6 

111444 

30 

6 

111444 

31 

6 

111444 

32 

6 

111448 

36 

6 

3 6 1 8 4 12 

48 

7 

3 1 11 7 16 4 4 

64 

8 

1 1 12 3 10 8 20 4 


Table 1: Shortest bases for various machine sizes. 


For p > 32 it becomes increasingly difficult to find extremal bases with 
elements near a power of 2 that would be shorter than the regular base. 
For p = 64, however, if we do not require all elements being a power 
of 2 , the shortest base is only k = 8 elements in length whereas the 
regular base needs fc = 11 elements. 


Again we illustrate the hyper-systolic algorithm using an extremal 
base for n = 32, see Fig. 


123456789 10 11 

32 123456789 10 

31 32 123456789 
30 31 32 1 2 3 4 5 6 7 8 

26 27 28 29 30 31 32 1 2 3 4 

22 23 24 25 26 27 28 29 30 31 32 

14 15 16 17 18 19 20 21 22 23 24 


12 13 14 15 16 17 18 19 20 21 22 
11 12 13 14 15 16 17 18 19 20 21 
10 11 12 13 14 15 16 17 18 19 20 
9 10 11 12 13 14 15 16 17 18 19 
5 6 7 8 9 10 11 12 13 14 15 
123456789 10 11 
25 26 27 28 29 30 31 32 1 2 3 


23 24 25 26 27 28 29 30 31 32 
22 23 24 25 26 27 28 29 30 31 
21 22 23 24 25 26 27 28 29 30 
20 21 22 23 24 25 26 27 28 29 
16 17 18 19 20 21 22 23 24 25 
12 13 14 15 16 17 18 19 20 21 
4 5 6 7 8 9 10 11 12 13 


Figure 8: Matrix C’ for n = 32 using the extremal base Aq = (111448). 
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4.7 Coarse grained limit 


The number of processors p of a given parallel machine and the number 
of array elements n can be different. In order to adapt both the 
standard-systolic and the hyper-systolic concept to this situation, we 
can partition the array of n elements into ^ subarrays, such that each 
processor deals with ^ data elements. Each subarray (of p elements) is 
distributed across the p processors and can be treated by the standard- 
systolic or hyper-systolic method such that successively all required 
combinations between all n data elements can be constructed. 

To illustrate the situation, the two stored hyper-systolic arrays for 


p = 16 and n 

Fig. |. 

= 32 using the extremal base of Table |I| are 

depicted 

in 
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Figure 9: Two matrices C 
A 4 = (12 2 4). 

’ for 

n - 

= 32 

and 

- P -- 

= 16 

using the extremal 


Considering the complexity, in the standard-systolic (Half-Orrery) 
computation, n{p + 1) interprocessor data movements have to be car¬ 
ried out, whereas in the hyper-systolic case the number of interpro- 
cessor movements follows as 2n{2^fpj2 — 1). The improvement again 
can be described by the ratio R of Eq. |I^, 





(17) 


With n S> p it becomes important to compare the pure computation 
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time vs. the interprocessor communication time as the former scales 
as O(n^). If we take n very large the parallel machine turns back to a 
scalar machine, i. e., interprocessor communication becomes irrelevant 
for both standard-systolic and hyper-systolic computations. Of course 
there is a limit in cpu-power, and the latter determines the maximal 
possible n that can be computed in reasonable time. In practice, for 
that value of n, one has to determine the amount of interprocessor 
communication overhead compared to the pure computation in order 
to asses a substantial improvement of the hyper-systolic method vs. 
the standard-systolic method. 


5 Performance Tests and Results 


We will demonstrate in the following section that the hyper-systolic 
method, applied to the computation of Eq. || on massively parallel 
computers, is indeed superior to standard-systolic implementations in 
real life applications. In the realistic setting of an exact molecular dy¬ 
namics simulation with long range forces, we study the performance of 
standard- hyper-systolic n-body dynamics with mutual gravitational 
interactions on connection machines CMS with numbers of nodes in 
the range of 16 to 1024 and 320-node Cray T3D. 


5.1 Molecular dynamics with long-range for¬ 
ces 


The gravitational force F{xi) acting on the i-th particle within an 
ensemble of n particles of equal mass m = 1 reads: 


n^i) = E 




Xi — Xi 


CC 7 X 4 


j + h 


(18) 


where the gravitation constant is set to 1. The f{xi,Xj) = 

Eq. 1^ belong to the class of symmetric bilocal objects used in Eq. 
We deal with two dimensions, i.e.,Xi = ( • 
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In a molecular dynamics simulation, given a distribution of particles to 
which coordinates and momenta are assigned at time step T = I, the 
long range forces between the particles are calculated with respect to 
Eq. and subsequently, the new positions and momenta at time step 
T = 1 + 1 are found using a suitable integration method like, e.g., the 
leap frog scheme]^ |]^ . The time-consuming part of the simulation is 
the computation of Eq. |l^. In order to improve on the interprocessor 
communication we have to concentrate on the efficient parallel imple¬ 
mentation of the latter. As we here deal with two dimensions, two 
arrays of coordinates, x = describe the position of the given 

particle. The momenta do not enter the computation of Eq. 18. 


5.2 Implementation issues 


At this stage the individual features of the programming languages 
and network topologies are discussed. 


CM-FORTRAN |^] This parallel language supports the concept 
of ‘virtual’ processors. Erom the user’s point of view, it allows to em¬ 
ulate a virtual parallel machine with n virtual processors, whereas the 
real machine has p real processors. In principle, the user can write 
programs as if there would be a machine with n-processors at hand, 
and the compiler and the operating system take care of the addi¬ 
tional administrative overhead. CM-FORTRAN appears to be well 
suited for the straightforward implementation of a one-dimensional 
standard-systolic array. The hyper-systolic method, however, requires 
explicit control over the data distribution across the p real processors. 
Therefore, on a machine with p real processors, we cut the array into 
n/p parts and declare each array of p elements as data-parallel array. 

The interprocessor communication network on a connection machine 
CM5 is based on the fat-tree concept [^]. We expect that communi¬ 
cation times to nodes at a ‘distance’ of more than one processor scale 
logarithmically with the number of processors. A connection machine 

^^The integration is a purely local procedure, therefore no interprocessor communication 
is required in that step. 
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CMS meets the requirements of the hyper-systolic concept with reg¬ 
ular array shifts: for any available machine sizes ranging from 16 to 
1024 processors, the shift of a data element to a location at distance 
at = K = \/pj2 in the hyper-systolic array, is achieved nearly as fast 
as the shift to the next array element {at = 1) because the latency 
dominates the communication time. 


T3D A message passing parallel programming language, as e.g. 
PVM implemented on the Cray T3D, does not support virtuality. We 
set up the array of n data elements on p processors available in the 
straightforward way discussed above, cutting the array into n/p parts. 
Note that PVM is not the only way to program message passing on the 
T3D. We did, however, not employ the fast low-level communication 
routines that are available. 

The T3D’s communication network consists of a three-dimensional 
torus. Unlike the CM5’s network, colliding messages are possible on 
this network. If the T3D application is run on a small partition it 
cannot use the toroidal structure. However, we expect the latency for 
exchange of interprocessor messages to dominate the ‘node-hopping’ 
time and therefore only little influence of this effect should be observ¬ 
able. 


5.3 Performance results on CM5 and T3D 


On the CMS, we want to avoid any interference of the ‘VP-manage- 
ment’ with improvements due to the hyper-systolic computation. For 
our tests, we decided to work with n = p on CMS and CRAY T3D, 
i. e., the VP-ratio is 1 on the CMS. Therefore, any gain in performance 
should theoretically scale as given by Eq. MB 

We had access to the connection machine CMS at GMD/HLRZ, Ger¬ 
many, with partitions of 16, 32, and 64 nodes and the GMS at Los 
Alamos National Laboratories, USA, with 128, S12 and 1024 node 

We emphasize that with n = p, we don’t have the advantage of vectorizing capabilities 
or communication pipelining. 
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partitions. Furthermore, we were able to conduct performance mea¬ 
surements on the 320-node Cray T3D at Edinburgh Parallel Comput¬ 
ing Centre (EPCC) where we tested our algorithm using partitions of 
16, 32, 64, 128 and 256 nodes. 

Table ^ gives the hyper-systolic shift constant K as a function of the 
number of processors and the number of shifts for the moving array 

X. 


p 

16 

32 

64 

128 

256 

512 

1024 

K 

2 

4 

6 

8 

12 

16 

24 

k 

4 

7 

11 

15 

23 

31 

47 


Table 2: Shift constants K for the regnlar shift seqnence as fnnction of the 
nnmber of processors. 


We plotted the actual interprocessor communication time (ipc) and 
the cpu-time for the computation of Eq. in one molecular dynamics 
step as function of the number of processors p = n in Eig. [l^ , both for 
the standard-systolic and the hyper-systolic algorithm. As indicated, 
the lines connect results from different partitions of one and the same 
machine. Fig. [T^ is a logarithmic replot of Fig. 10 to exhibit scaling 
behaviour. The figure shows equal scaling of all curves except the 
curve showing the time for the hyper-systolic communication. We 
note that the hyper-systolic computation appears to need slightly less 
cpu-time than the systolic computation. This is probably due to a 


slightly improved pipelining of the cpu-operations, see footnote ^ 
page 


on 


It turns out that the standard-systolic computation—for p = n—is 
dominated by interprocessor communication on both the connection 
machines CM5 and the CRAY T3DP^ On the T3D, the effect of the 

^^Note that we have chosen a realistic setting of the gravitational force, leading to the 
compute intensive calculation of a square root for every coordinate difference. In our test 
model, the system is not given a boundary condition, and we have cut off the potential 
for very small separations of particles, as is necessary in realistic simulations. 
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Figure 10: Interprocessor communication time (ipc) and cpu-time for the 
computation of Eq. |^in one molecular dynamics step as function of the num¬ 
ber of processors p = n. The lines connect results from different partitions 
of one and the same machine. 
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Figure 11: Same plot as before using a logarithmic y-scale. 






hyper-systolic computation being less cpu-time intensive is much pro¬ 
nounced due to the time consuming caching. Most of the time is spent 
with load and store operations between cache and local memory. 


Since both, the CPU-time required and the interprocessor communi¬ 
cation time should grow proportional to n^, the ratio between both 
values would be expected to vary only slightly for the standard-systolic 
method as is the case for the T3D. On the CM5, we have to take 
into account logarithmic corrections in interprocessor communication 
as explained later. The result is depicted in Fig. |l2|. We see that 
the interprocessor communication will become irrelevant on massively 
parallel machines, where the hyper-systolic computation of Eq. |T^ can 
show its full power. 


In Fig. H, we present the gain-factor R between the standard-systolic 
and hyper-systolic interprocessor communication times along with the 
theoretical prediction. In our particular implementation, we expect a 
scaling of the gain-factor according io R = v^p/2/3 in contrast to 
Eq. IT. The dashed curve represents this naive expectation. On the 


CMS, however, we have to take into account logarithmic corrections 
to the scaling law. This is due to the fat-tree hierarchy of its commu¬ 
nication network: communication times to processors in a ‘distance’ 
r increase logarithmically with r. Our findings demonstrate that the 
hyper-systolic algorithm will at any rate be superior to conventional 
methods: the interprocessor communication expense grows with nz 
as compared to v?. The full power of the method can be exploited on 
big massively parallel machines. 


The additional amount of memory needed in hyper-systolic computa¬ 
tions is tolerable in view of the present machine’s and problem’s sizes: 
even on the largest CM5 with p = 1024 we only need 2 x 24 x 1024 
words of additional memory. Of course, this has to be multiplied by 
n/p if we go to n > p. 
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Figure 12: Ratio of interprocessor communication time to cpu-time for the 
standard-systolic and hyper-systolic computation. 
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# of processors 


Figure 13: Gain-factor R between the standard-systolic and hyper-systolic 
interprocessor communication times. Two theoretically predicted curves 
are plotted. On the CMS, we take into account logarithmic corrections to 
vF^/3. 
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6 Discussion 


Summary We have presented a new parallel algorithm for the ef- 
hcient computation of systems of n elements with pairwise mutual 
interactions among all elements, denoted as hyper-systolic algorithm. 
We have developed the hyper-systolic array computation by general¬ 
izing the ordinary systolic realization of the computational problem. 
We illustrated the hyper-systolic algorithm in form of the canonical 
abstract automaton representation as it is known from systolic array 
computations. 

In the context of Additive Number Theory, the hyper-systolic method 
very naturally can be formulated as an /i-range problem n{h, A/j). We 
have discussed a solution of the latter problem which is well suited for 
the explicit parallel implementation of a hyper-systolic computation 
on SIMD and MIMD machines and nearly optimal with respect to the 
efficiency of the interprocessor communication. 

We applied our method to the computation of the molecular dynam¬ 
ics evolution of a system of n particles with long range forces. In the 
practical implementation on 16 up to 1024-node connection machines 
CMS and on 16 up to 256-node partitions of the CRAY T3D we could 
verify the expected theoretical improvement in interprocessor commu¬ 
nication as compared to the standard-systolic realization. 


Conclusions The hyper-systolic way of parallel computing can be 
extended to a variety of parallel computations which are usually done 
by systolic n^-loops. The potential of the method appears of great 
promise. It might even affect the very architectural issue of parallel 
machines, as it is efficiently applicable on massively parallel machines 
with simple (and therefore cheap) toroidal next-neighbour communi¬ 
cation networks |^l|. As an example, we will implement the hyper- 
systolic array computation on Quadrics (APE 100) machines produced 
by Alenia Spazio S.p.A. |21|. These machines are equipped with a sim¬ 
ple three-dimensional next-neighbour network. We expect the hyper- 
systolic computation to speed-up the Quadrics performance consider¬ 
ably 1^]. The full payoff of the idea will be achieved, of course, on 
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real massively parallel platforms such as described in p^ . 

We hope that the hyper-systolic approach of tackling n^-loop problems 
will provide an essential improvement for practical simulations. 
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